Áêéä ß Ø Òóøøøö Ääòòùùùù Óö Éùùöýýòò Ëëññ¹ëøöù Blockinøùööö
نویسندگان
چکیده
references year DBIS/IRQL/bncod.ps uni-rostock.de/2000/ http://e-lib.informatik. Andreas Heuer Denny Priebe IRQL-Yet another language for querying semistructured data? 2000 authors title In this paper we describe the basic ideas and concepts behind the ... During the last years the WWW became generally accepted as a medium to publish various kinds of information (documents). ... [ABI97] Serge Abiteboul. Querying Semi-Structured Data. In Foto N. Afrati and Phokion Kolaitis, editors, Database Theory ICDT’97 ... source complete_content metadata text sections appendix article document Figure 5: abstra t attributes indi ates the full text of the original page. As Figure 5 shows, omplete ontent is an abstra tion of a set of di erent attributes, e.g. metadata su h as authors and text. Text, in turn, is another abstra tion of further attributes su h as the abstra t or the referen es of the modelled arti le. In ontrast to obje t-oriented or obje t-relational database models, the attributes omplete ontent and text are no tuple-valued attributes. For example, omplete ontent would onsist of two di erent omponents metadata and text in the obje t models. Here, omplete ontent is onsidered as one text value again. The advantage of this kind of abstra tion operator is the usability for information retrieval operations. If useful, the omplete ontent value an be seen as one atomi value. The problem that obje t-oriented and obje t-relational models (that are used as implementation models) do not support this kind of abstra tion is hidden from the user: Our abstra tion operator is implemented on top of existing obje t-oriented and obje t-relational on epts. 4 Language The aim of IRQL development is to integrate database query languages and information retrieval te hniques. Similiar to Lorel, our approa h is to realize a query language in the style of SQL, but we additionally support information retrieval te hniques by adding new lauses. Like some of the query languages mentioned in Se tion 2, we also hange the type he king rules of SQL to also support querying semi-stru tured heterogeneous data. Be ause of these demands, we take the re ently adopted SQL99 standard [SQL99a, SQL99b℄ as a starting point. We plan to implement a large subset of the proposed syntax in order to be able to answer queries onforming to this standard. Using examples, we subsequently show possibilities for querying semi-stru tured data and integrating information retrieval te hniques into IRQL. 8 4.1 Stru tured and Semi-stru tured Data The data model des ribed in Se tion 3 supports querying stru tured and semistru tured data. Stru tured omposite data are modelled as elements of the stru t data type and are therefore subje t to the strong type he king as found in e.g. SQL99. Semi-stru tured data are modelled as elements of a spe ial data type (do ). We modify the type he king rules for instan es of this data type3 so that meaningful queries are possible, even if the s hema is not known or only partially known. These modi ations in lude: (1) in ompatible data types are asted to ompatible types, if ne essary and (2) in heterogeneous data, non-existent attributes may be referen ed. The on rete semanti s are dependent on the type operation used. For example, a non-existent attribute referen ed in a proje tion is ignored for ea h tuple it does not appear in. Therefore, the operation's result is again heterogeneous. Sele tion prediates referen ing su h attributes are evaluated to false. (3) Partially known s hemata an be queried using path expressions and path variables. For example, the query sele t r.name, ##z, r. ards from R2 r, r.pri e.f.*gz where ##z < 200 name single double H ubner 195 235 name single double twin app. ards Kr uger 95 150 185 135 Ameri an Express Visa Diners Club Figure 6: heterogeneous result results in the heterogeneous instan e shown in Figure 6. R2 denotes the instan e from Figure 4. The elements of this set are typed as do and are therefore subje t to the type he king rules mentioned earlier. In the from lause, regular path expressions are used and the path variable z is de lared. The regular path expression f.*g expands to all existing attributes below pri e. The appended (optional) label z denotes the orresponding variable name. The expression ##z dereferen es the path variable z and is substituted by the omplete paths (in this example R2.pri e.single, R2.pri e.double for the rst tuple, as well as R2.pri e.single, R2.pri e.double, R2.pri e.twin and R2.pri e.app for the se ond tuple). The attribute ards referen ed in the sele t lause is ignored while pro essing the rst tuple as there is no su h attribute there. If one or more pri es are string types, these pri es would have to be onverted to numeri values to evaluate the predi ate ##z < 200. If su h a onversion is not possible, the predi ate is evaluated to false. 3Essentially, we adopt the te hniques (primarily Lorel's) used in existing query languages for querying semi-stru tured data. 9 name pri e single double H ubner 195 235 Kr uger 95 150 Figure 7: instan e (R3) The integration of stru tured and semi-stru tured data is realized by modelling these data as instan es of di erent data types. For example, let the instan e shown in Figure 7 be of type sethdo (name: string, pri e: stru t(single: integer, double: integer))i. The query sele t r.pri e.single, r.address from R3 r delivers the pri es of all single rooms be ause the non-existent attribute address is dire tly ignored within a do type. But the query sele t r.pri e.app, r.address from R3 r leads to a runtime error be ause the pri e is modelled as a stru t type where nonexistent attributes may not be referen ed. As illustrated previously, the di eren e between operations on stru tured and semi-stru tured data an be modi ed in order to be valid for other operations, too. 4.2 Extensions In IRQL, there is one basi extension des ribing all known do uments. We all this extension d world. In order to avoid onsidering all these do uments in every query, we introdu e some possibilities to reate further (e.g. smaller) extensions. Useful riteria in lude (1) information about how or whether a do ument an be rea hed via a parti ular path, (2) the language of do uments, (3) the possibly named do ument types, and (4) the do ument's domain. We express ea h of these as an extension to the from lause. The extension ( oll) of do uments that are rea hable starting from a given URL is determined by h olli REACHABLE FROM hURLi [DEPTH hvaluei℄ [LOCAL℄. As options, a maximum path length (DEPTH parameter) an be spe i ed or only lo al do uments an be hosen. All do uments in a given language an be determined by h olli IN LANGUAGE hlangi. The lause 10 h olli OF [NAMED℄ TYPE hty oni reates a do ument extension for other possibly named type. If the keyword NAMED is omitted, ty on stands for a type onstru tor, otherwise it is a label like posts ript (PS). Finally, h olli OF DOMAIN hdomaini determines the extension of all do uments of a given domain (e.g. tourism). 4.3 Information Retrieval In the following, we des ribe a further extension to SQL99; namely predi ates that implement information retrieval te hniques (e.g. ontent-based retrieval, soundex and proximity sear h, term weighting and ranking of query results). These possibilities are parti ularly suited for, but not limited to, semi-stru tured data. 4.3.1 Content-based Retrieval We denote ontent-based retrieval by the lause hattributei CONTAINS htexti [ATLEAST hvaluei℄ [ATMOST hvaluei℄ [WITH WEIGHT hvaluei℄ [CASE SENSITIVE℄ [SUBSTRING℄ [(hvaluei j NO) ERRORS℄. The following optional parameters exist: (1) ATLEAST, ATMOST spe i es how often text must o ur in attribute. If the number of o urren es of text is not within the spe i ed bounds, the predi ate is evaluated to false. If one or both of the parameters are omitted, no limit is assumed. (2) WITH WEIGHT spe i es the weight of the query term. The default value is a weight of one. (3) By default, the sear h is ase insensitive. This an be hanged by spe i ying the CASE SENSITIVE parameter. (4) SUBSTRING spe i es not only mat hing word bounds (e.g. spa es) but also sear hing for any o urren e of the given substring. (5) There is also a possibility for onsidering typing errors (see glimpse [Har℄ how this an be realized) by spe ifying a value for the ERRORS parameter. By default, no typing errors are onsidered. The values of text an be either keywords or phrases. Furthermore, we support regular expressions (wild ards) here. 4.3.2 Soundex The soundex algorithm allows the sear h for phoneti ally similar keywords or phrases. We denote the soundex sear h by hattributei SOUNDEX htexti [ATLEAST hvaluei℄ [ATMOST hvalue i℄. The meaning of ATLEAST and ATMOST an be taken from Se tion 4.3.1. Further parameters mentioned there are not meaningful within the ontext of a soundex sear h. 11 4.3.3 Proximity The next supported on ept of ontent-based retrieval is the proximity sear h. Using a proximity sear h, it is possible to spe ify the distan e between two keywords or phrases. The denotation is as follows: hattributei CONTAINS htexti [WITH WEIGHT hvaluei℄ [hvaluei huniti℄ BEFORE j AFTER htexti [WITH WEIGHT hvaluei℄ [ATLEAST hvaluei℄ [ATMOST hvaluei℄ [CASE SENSITIVE℄ [SUBSTRING℄ [(hvaluei j NO) ERRORS℄. Here, we only des ribe the new parameters. The others an be found in Se tion 4.3.1. The new parameter unit an be substituted by a type-dependent unit. For example, if d is a LATEX do ument and a method exists to split this do ument into se tions, then d CONTAINS ``related work'' 2 SECTIONS BEFORE `` on lusion'' is a valid predi ate that he ks whether d ontains the phrase \related word" not more than 2 se tions before the keyword \ on lusion". 4.3.4 Ranking We support ranking results by user-de ned riteria. Synta ti ally, this is denoted by RANK BY f0; : : : ; fn [LIMIT TO hvaluei℄ The fi denote user-de ned fun tions that de ne the al ulation of the retrieval status value (RSV). The RSV is an attribute that is appended by the rank by lause and, after al ulating this value, the result is sorted by RSV. Although we next plan to support the ve tor spa e model, we don't need to hange our syntax if we implement a probabilisti model, as the following example demonstrates: SELECT RSV, name FROM hotels d RANK BY d.stars=5, d.bea hdist=0 In this query we de ne a ranking using boolean predi ates. These predi ates are not evaluated to true or false, but de ne the \best" hotel. Thus, the retrieval status value of one is assigned to a ve-stars-hotel dire tly situated at the bea h. Using probabilisti methods, the other hotels are ranked a ordingly. The optional part of the rank by lause is used to limit the number of returned elements to value. By default, the number of elements is unlimited. 12 4.3.5 Compatibility with DBQLs and Information Retrieval On the one hand, ompatibility with SQL is a hieved if there are no semi-stru tured data, and therefore, no do type data in any of the extensions queried. In this ase, any query that is a valid query within the supported subset of SQL99 is also a valid IRQL query and delivers the same result. On the other hand, ompatibility with information retrieval expressions is a hieved by transparently mapping these expressions to IRQL queries, as the following example demonstrates: Assume we are interested in a hotel near the bea h. Using one of the sear h engines, we would probably enter hotel and bea h. This expression is also a epted by IRQL and transparently mapped to4 sele t hdefault proje tioni from hdefault extensioni where hdefault attributei CONTAINS \hotel" AND hdefault attributei CONTAINS \bea h". The default values an be adjusted within the IRQL shell. A good hoi e would be to use sour e,title as default proje tion, d world as default extension, and omplete ontent as default attribute. In this simple example the resulting query sele t sour e,title from d world where omplete ontent CONTAINS \hotel" AND omplete ontent CONTAINS \bea h" delivers the expe ted information. 5 Appli ations The IRQL language is used in two di erent proje ts: In the BlueView proje t5, digital library servi es are developed and partially implemented based on the ar hite ture of virtual do ument servers. Using standard tools like full-text database or information retrieval systems, obje t-relational database management systems, and repli ation and a hing servi es, di erent heterogeneous lo al do ument servers have been integrated into one lo al server. IRQL is the query language for this integrated lo al do ument server be ause it an be implemented on top of these di erent platforms. In the GETESS proje t6, we analyse web do uments using linguisti and domainspe i knowledge [SBB+99a, SBB+99b℄ and use the data gathered in this way to answer user queries. Here IRQL serves as an internal query language. Within this ontext, we use some default attributes if the orresponding data originated in web 4For simpli ity, we ignore the ranking. 5http://wwwdb.informatik.uni-rosto k.de/blueview 6http://www.getess.de 13 do uments: e.g. the attribute \sour e" points to the URL where the data are gathered from; the attribute \ omplete ontent" ontains the full text of the original page. 6 Con lusion and Future Work In this paper we present our approa h for developing an Information Retrieval Query Language (IRQL). The used data model distinguishes stru tured and semi-stru tured heterogeneous data based on type information and supports an abstra tion of attribute names. IRQL integrates on epts of database query languages, query languages for semi-stru tured data, and information retrieval te hniques. The starting point of IRQL development is SQL99, whi h we extend with new lauses to integrate information retrieval te hniques. Furthermore, we modify the type system to support semi-stru tured heterogeneous data. IRQL7 is built on top of existing systems su h as obje t-relational DBMSs, relational DBMSs, or full-text DBMSs. The urrent prototype implementation has been built on top of DB2 and its text extender. To the best of our knowledge, there is no similar proposal that attempts to integrate features of these three areas. Future works in lude the omplete formalization of the query language and the development of an algebra. Referen es [Abi97℄ Serge Abiteboul. Querying Semi-Stru tured Data. In Foto N. Afrati and Phokion Kolaitis, editors, Database Theory ICDT '97, 6th International Conferen e, volume 1186 of Le ture Notes in Computer S ien e, pages 1{18, Delphi, Gree e, January 1997. Springer Verlag. [AM98℄ Gustavo O. Aro ena and Alberto O. Mendelzon. WebOQL: Restru turing Do uments, Databases, and Webs. In Pro eedings of the Fourteenth International Conferen e on Data Engineering, pages 24{33, Orlando, Florida, USA, February 1998. IEEE Computer So iety Press. [AMM97℄ Gustavo O. Aro ena, Alberto O. Mendelzon, and George A. Mihaila. Appli ations of a Web Query Language. In Pro eedings of the 6th International WWW Conferen e, Santa Clara, California, 1997. [AQM+97℄ Serge Abiteboul, Dallan Quass, Jason M Hugh, Jennifer Widom, and Janet L. Wiener. The Lorel Query Language for Semistru tured Data. International Journal on Digital Libraries, 1(1):68{88, 1997. [BDHS96℄ Peter Buneman, Susan B. Davidson, Gerd G. Hillebrand, and Dan Su iu. A Query Language and Optimization Te hniques for Unstru tured Data. In H. V. Jagadish and Inderpal Singh Mumi k, editors, Pro eedings of the 1996 ACM SIGMOD International Conferen e on Management of Data, 7http://wwwdb.informatik.uni-rosto k.de/irql 14 volume 25(2) of SIGMOD Re ord, pages 505{516, Montreal, Quebe , Canada, June 1996. [Clu97℄ Sophie Cluet. Modeling and Querying Semi-Stru tured Data. In Maria Teresa Pazienza, editor, Information Extra tion: A Multidis iplinary Approa h to an Emerging Information Te hnology, International Summer S hool, SCIE-97, volume 1299 of Le ture Notes in Computer S ien e, pages 192{213, Fras ati, Italy, 1997. Springer Verlag. [FFLS97℄ Mary F. Fernandez, Daniela Floresu, Alon Y. Levy, and Dan Su iu. A Query Language for a Web-Site Management System. In SIGMOD Re ord, volume 26(3), pages 4{11, 1997. [FLM98℄ Daniela Flores u, Alon Y. Levy, and Alberto O. Mendelzon. Database Te hniques for the World-Wide Web: A Survey. In SIGMOD Re ord, volume 27(3), pages 59{74, 1998. [GMW99℄ Roy Goldman, Jason M Hugh, and Jennifer Widom. From Semistru tured Data to XML: Migrating the Lore Data Model and Query Language. In Sophie Cluet and Tova Milo, editors, ACM SIGMOD Workshop on The Web and Databases (WebDB'99), pages 25{30, Philadelphia, Pennsylvania, USA, June 1999. INRIA. Informal Pro eedings. [Har℄ Harvest. http://harvest.transar . om. [KS95℄ David Konopni ki and Oded Shmueli. W3QS: A Query System for the World-Wide Web. In Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio, editors, VLDB'95, Pro eedings of 21th International Conferen e on Very Large Data Bases, pages 54{65, Zuri h, Switzerland, September 1995. Morgan Kaufmann Publishers. [LSCH98℄ Wen-Syan Li, Junho Shim, K. Sel uk Candan, and Yoshinori Hara. WebDB: A Web Query System and its Modeling, Language, and Implementation. In Pro eedings of the IEEE Forum on Resear h and Te hnology Advan es in Digital Libraries, IEEE ADL'98, pages 216{227, Santa Barbara, CA, USA, April 1998. [LSS96℄ Laks V. S. Lakshmanan, Fereidoon Sadri, and Iyer N. Subramanian. A De larative Language for Querying and Restru turing the WEB. In Proeedings: Sixth International Workshop on Resear h Issues in Data Engineering | Interoperability of Nontraditional Database Systems, IEEE-CS 1996, pages 12{21, New Orleans, Louisiana, USA, February 1996. [MAG+97℄ Jason M Hugh, Serge Abiteboul, Roy Goldman, Dallan Quass, and Jennifer Widom. Lore: A Database Management System for Semistru tured Data. In SIGMOD-Re ord, volume 26(3), pages 54{66, September 1997. [Pfe95℄ Ulri h Pfeifer. freeWAIS-sf. Universitat Dortmund, O tober 1995. Manual of the enhan ed freeWAIS distribution. [PFH95℄ Ulri h Pfeifer, Norbert Fuhr, and Tung Huynh. Sear hing Stru tured Do uments with the Enhan ed Retrieval Fun tionality of freeWAIS-sf and SFgate. In Pro eedings of The Third International World-Wide Web Conferen e, Darmstadt, Germany, April 1995. 15 [SBB+99a℄ Ste en Staab, Christian Braun, Ilvio Bruder, Antje D usterhoft, Andreas Heuer, Meike Klettke, G unter Neumann, Bernd Prager, Jan Pretzel, Hans-Peter S hnurr, Rudi Studer, Hans Uszkoreit, and Burkhard Wrenger. A System for Fa ilitating and Enhan ing Web Sear h. In IWANN '99 | Pro eedings of International Working Conferen e on Arti ial and Natural Neural Networks, Ali ante, ES, 1999. [SBB+99b℄ Ste en Staab, Christian Braun, Ilvio Bruder, Antje D usterhoft, Andreas Heuer, Meike Klettke, G unter Neumann, Bernd Prager, Jan Pretzel, Hans-Peter S hnurr, Rudi Studer, Hans Uszkoreit, and Burkhard Wrenger. GETESS | Sear hing the Web Exploiting German Texts. In M. Klus h, O. Shehory, and G. Weiss, editors, Cooperative Information Agents III, Pro eedings 3rd International Workshop CIA-99, volume 1652. Springer Verlag, July 1999. [SQL95℄ ISO Working Draft, SQL Multimedia and Appli ation Pa kages (SQL/MM), Part 2: Full-Text, September 1995. [SQL99a℄ ANSI X3H2-99-078/WG3:YGJ-010, (ANSI-ISO Working Draft), Framework (SQL/Framework), Mar h 1999. [SQL99b℄ ANSI X3H2-99-079/WG3:YGJ-010, (ANSI-ISO Working Draft), Foundation (SQL/Foundation), Mar h 1999. 16
منابع مشابه
Ëëåå Ëøöù Blockinøùööð Óòøøüø× Åóð Óö Áñôöóúúòò Óñôöö×××óò Ò Ëëññ×øöù Blockinøùööö Ìüø Øøøø×××
متن کامل
Ëëåå Ëøöù Blockinøùööð Óòøøüø× Åóð Óö Áñôöóúúòò Óñôöö×××óò Ò Ëëññ×øöù Blockinøùööö Ìüø Øøøø×××
متن کامل
Êêôöö××òøøøøóò Òò Áòòòööò Óö Aeaeøùööð Ääòòùùùù Ö×ø Óùö×× Ò Óñôùøøøøóòòð Ëëññòøø Blockin× Îóðùññ Áá Ïóöööòò Ûûøø × Blockinóùö×× Êêôöö××òøøøøóò Ëøöù Blockinøùöö× Èøöö Ðð Blockinùöò ² Âóóóò Ó×
for Natural Language A First Course in Computational Semanti s Volume II Working with Dis ourse Representation Stru tures Patri k Bla kburn & Johan Bos September 3, 1999
متن کامل